iMAP Workflow

Code for implementing iMAP is designed with four bundles of commands wrapped individually in bash driver scripts for performing bioinformatic analysis of microbiome data (Figure 1). The output is then transformed into data structure suitable for conducting exploratory visualization. A progress report is generated at each major analysis step to summarize the output.


Schematic illustration of the iMAP pipeline.


Requirements

The first step is to gather all materials needed for implementing the iMAP pipeline smoothly (Table S1).


Table S1: List of required materials for running iMAP pipeline
Required Description Folder Remarks
iMAP pipeline Bundled scripts for comprehensive microbiome analysis iMAP Link
Hardware Computer with multi-core processor: preferably 64-bit.
Remote Accessory Memory (RAM): 8 GB minimum.
Storage: Tens of gigabytes for small dataset otherwise a few terabytes
Raw data Demultiplexed reads in FASTQ format with primers and barcodes removed data/references
Sample metadata A tab-separated file showing sample identifiers, categorical and numeric variables data/metadata
Mapping file A file that links sample IDs (1st column) to the names of forward (2nd column) and reverse (3rd column) data files
Design files Files that assign samples to a specific variables or other categories
Software
sekit For inspecting rawdata format and simple statistics code Link
FASTQc For creating base call quality score images and statistics code Link
bbmap_bbduk For trimming poor quality reads code Link
multiqc For summarizing FASTQc output Link
Mothur For sequence processing and classifying the sequences and preliminary analysis code Link
Statistical analysis and visualization
R For statistical analysis and visualization Link
Rstudio An IDE (integrated development environment) for R Link
iTOL For display, annotation and management of phylogenetic trees Link
Reference 16S rRNA gene alignments
SILVA (nr) Reference rRNA alignments data/references Link
Reference 16S rRNA gene classifiers
SILVA(no gap) Degapped using degap.seqs function in Mothur data/references Link
RDP Mothur-formatted data/references Link
Greengenes Mothur-formatted data/references Link
EzBioCloud Mothur-formatted data/references Link
Custom classifiesr Any manually built classifiers


Download iMAP repository

git clone https://github.com/tmbuza/iMAP.git
cd iMAP

# OR

curl -LOk https://github.com/tmbuza/iMAP/archive/master.zip
unzip master.zip
mv iMAP-master iMAP
rm -rf master.zip
cd iMAP


# OR


wget --no-check-certificate https://github.com/tmbuza/iMAP/archive/master.zip 
unzip master.zip
mv iMAP-master iMAP
rm -rf master.zip
cd iMAP


Gather required materials

  • Raw data (demultiplexed compressed FASTQ files).
  • Metadata, mothur-formatted mapping files (commonly with extension .design)
  • Install software
  • Download reference databases
# Mac
bash ./code/requirements/iMAP_requirements_mac_driver.bash

# Linux

bash ./code/requirements/iMAP_requirements_linux_driver.bash


Verify required folders and files

bash ./code/requirements/iMAP_checkFiles_driver.bash


Figure S1: Major folders in the iMAP root directory. Folders and files marked with ✅ exist. Missing file marked ❌ must be found by the above script before proceeding.




Bioinformatics analysis


CLI: Command-line-interface

This is basically a method where users sequentially run individual or bundle scripts on CLI (Command -Line_Interface) one at a time. We have bundled workflow-specific scripts into a driver to make the analysis easily implemented on CLI by just a single click.

bash ./code/requirements/iMAP_requirements_mac_driver.bash
bash ./code/requirements/iMAP_checkFiles_driver.bash
bash ./code/preprocessing/iMAP_preprocessing_driver.bash
bash ./code/summarizeFastQC/iMAP_multiqc_driver.bash
bash ./code/mockcommunity/iMAP_mockcommunity_driver.bash
bash ./code/seqprocessing/iMAP_seqprocessing_driver.bash
bash ./code/seqclassification/iMAP_seqclassification_driver.bash
bash ./code/seqerrorrate/iMAP_seqerrorrate_driver.bash
bash ./code/otutaxonomy/iMAP_otutaxonomy_driver.bash
bash ./code/annotation/01_processed_seqs.bash
bash ./code/dataanalysis/iMAP_dataanalysis_demo_driver.bash


Batch mode on CLI

The iMAP_driver.bash is the master driver for running all analyses on CLI at once.

bash ./code/iMAP_driver.bash


Remotely via job scheduling script

Users must create a Portable Batch System (PBS) script that describes cluster resources to be used, parameters for the job and the commands to be executed. The following is a PBS script for running executing iMAP pipeline remotely. Note that you must provide the group allocation name (-A) but this may differ from one system to the other. Google for help just in case.


Individual driver

#!/bin/bash -f

#PBS iMAPtest
#PBS -A group allocation name
#PBS -l nodes=1:ppn=8
#PBS -l walltime=4000:00:00
#PBS -l pmem=20gb
#PBS -j oe
#PBS -o iMAPtest.log
#PBS -m abe
#PBS -M tmb72@psu.edu

cd $PBS_O_WORKDIR

bash ./code/requirements/iMAP_requirements_mac_driver.bash
bash ./code/requirements/iMAP_checkFiles_driver.bash
bash ./code/preprocessing/iMAP_preprocessing_driver.bash
bash ./code/summarizeFastQC/iMAP_multiqc_driver.bash
bash ./code/mockcommunity/iMAP_mockcommunity_driver.bash
bash ./code/seqprocessing/iMAP_seqprocessing_driver.bash
bash ./code/seqclassification/iMAP_seqclassification_driver.bash
bash ./code/seqerrorrate/iMAP_seqerrorrate_driver.bash
bash ./code/otutaxonomy/iMAP_otutaxonomy_driver.bash
bash ./code/annotation/01_processed_seqs.bash
bash ./code/dataanalysis/iMAP_dataanalysis_demo_driver.bash


Batch mode

#!/bin/bash -f

#PBS iMAPtest
#PBS -A group allocation name
#PBS -l nodes=1:ppn=8
#PBS -l walltime=4000:00:00
#PBS -l pmem=20gb
#PBS -j oe
#PBS -o iMAPtest.log
#PBS -m abe
#PBS -M tmb72@psu.edu

cd $PBS_O_WORKDIR

bash code/iMAP_driver.bash



Progress report 1


Metadata profiling


Status of metadata


variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
SampleID 0 0.00 0 0 0 0 character 360
BarcodeSequence 0 0.00 360 100 0 0 character 0
ForwardPrimerSequence 0 0.00 360 100 0 0 character 0
ReversePrimerSequence 0 0.00 360 100 0 0 character 0
ForwardRead 0 0.00 0 0 0 0 character 360
ReverseRead 0 0.00 0 0 0 0 character 360
origGroup 0 0.00 0 0 0 0 character 360
Sex 0 0.00 0 0 0 0 character 2
Time 0 0.00 0 0 0 0 character 2
DayID 0 0.00 0 0 0 0 character 35
DPW 12 3.33 0 0 0 0 integer 35
Description 0 0.00 0 0 0 0 character 1

Key: q_zeros: quantity of zeros or missing data ; q_na: quantity of NA; q_inf: quantity of infinite values; type: factor, character, integer or numeric..; unique: quantity of unique values. Percentages are indicated by p_.

When reviewing metadata status you may notice that numeric variables are confused with numeric data. For example, variable DPW (days post weaning) on Table x shows 12 missing values (3.33%) which is incorrect. As investigators we know that day 1 was coded zero (0) (12 zeros for 12 mouse) which in descriptive statistics is interpreted as missing data. To correct this we re-coded the samples with unique identifies shown as DayID to distinguish them from integers. However, depending of experiment this kind of variables need to be converted to character during analysis.


Subset of the metadata for mapping purposes


First 5 lines of mapping metadata

SampleID origGroup Sex Time DayID DPW
F3D000 F3D0 Female Early D000 0
F3D001 F3D1 Female Early D001 1
F3D002 F3D2 Female Early D002 2
F3D003 F3D3 Female Early D003 3
F3D005 F3D5 Female Early D005 5


Last 5 lines of mapping metadata

SampleID origGroup Sex Time DayID DPW
M6D148 M6D148 Male Late D148 148
M6D149 M6D149 Male Late D149 149
M6D150 M6D150 Male Late D150 150
M6D175 M6D175 Male Late D175 175
M6D364 M6D364 Male Late D364 364


Frequency of experimental variables


Figure 3. Frequency of experimental variables



Posible questions


At metadata profiling step

  • QN1: Are there samples to be removed from the analysis based on their frequency?
  • QN2: Are there enough variables for downstream diversity and statistical analysis?
  • QN3: …….?
  • QN4: …….?
  • QN5: …….?





Progress report 2


Pre-processing of paired-reads


Read count


Number of forward reads before QC (Original_R1):     3634461 ( 100 % )
Number of reverse reads before QC (Original_R2):     3634461 ( 100 % )
Total number of reads before QC:     7268922


Read length


Figure x: ensity and histograms plots showing forward and reverse read length. Dotted line indicates mean value for the specified variable



Distribution of raw reads


Barplots


Grouped by Sex variable

Figure x: Barplots of pre-processed reads grouped by sex variable.


Grouped by Time variable

Figure x: Barplots of pre-processed reads grouped by time variable


Grouped by days-post-weaning (D) variable

Figure x: Barplots of pre-processed reads grouped by time variable


Base call quality


Original base call quality (qc0)


open ./results/multiqc/qc0/multiqc_report.html
open ./results/multiqc/qc0/R1/multiqc_report.html
open ./results/multiqc/qc0/R2/multiqc_report.html


Figure x: Example of report generated by running open ./results/multiqc/qc0/multiqc_report.html on CLI


Base call quality after trimming at Q = 25 (qctrim25)


open ./results/multiqc/qctrim25/multiqc_report.html
open ./results/multiqc/qctrim25/R1/multiqc_report.html
open ./results/multiqc/qctrim25/R2/multiqc_report.html


Figure x: Example of report generated by running open ./results/multiqc/qced/R2/multiqc_report.html on CLI


Base call quality after removing phiX


open ./results/multiqc/qced/multiqc_report.html
open ./results/multiqc/qced/R1/multiqc_report.html
open ./results/multiqc/qced/R2/multiqc_report.html


Figure x: Example of report generated by running open ./results/multiqc/qced/R2/multiqc_report.html on CLI


Status of pre-processed reads


variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
SampleID 0 0 0 0 0 0 character 360
origGroup 0 0 0 0 0 0 character 360
Sex 0 0 0 0 0 0 character 2
Time 0 0 0 0 0 0 character 2
DayID 0 0 0 0 0 0 character 35
Original_R1 0 0 0 0 0 0 numeric 357
TrimQ25_R1 0 0 0 0 0 0 numeric 357
NophiX_R1 0 0 0 0 0 0 numeric 359
Original_R2 0 0 0 0 0 0 numeric 357
TrimQ25_R2 0 0 0 0 0 0 numeric 357
NophiX_R2 0 0 0 0 0 0 numeric 359

Key: q_zeros: quantity of zeros or missing data ; q_na: quantity of NA; q_inf: quantity of infinite values; type: factor, character, integer or numeric..; unique: quantity of unique values. Percentages are indicated by p_.


Distribution of pre-processed reads


Stacked barplot


Figure x: Stacked barplots of pre-processed reads grouped by QC variable


Boxplots


Figure x: Boxplot of pre-processed reads plotted with overplotting points (A) and jitter points (B).




Grouped by variable



Summary of pre-processed reads


Total read count


  • Sum of reads in all samples
Number of forward reads before QC (Original_R1):     3634461 ( 100 % )
Number of reverse reads before QC (Original_R2):     3634461 ( 100 % )
Number of forward reads after trimming at Q=25 (TrimQ25_R1):     3631940 ( 99.93064 % )
Number of reverse reads after trimming at Q=25 (TrimQ25_R2):     3631940 ( 99.93064 % )
Number of forward reads after phiX removal (NophiX_R1):  3631769 ( 99.92593 % )
Number of reverse reads after phiX removal (NophiX_R2):  3631769 ( 99.92593 % )


Descriptive statistics


Original_R1 TrimQ25_R1 NophiX_R1 Original_R2 TrimQ25_R2 NophiX_R2
Min. : 14 Min. : 14 Min. : 14 Min. : 14 Min. : 14 Min. : 14
1st Qu.: 5368 1st Qu.: 5365 1st Qu.: 5365 1st Qu.: 5368 1st Qu.: 5365 1st Qu.: 5365
Median : 8001 Median : 7996 Median : 7996 Median : 8001 Median : 7996 Median : 7996
Mean :10096 Mean :10089 Mean :10088 Mean :10096 Mean :10089 Mean :10088
3rd Qu.:13636 3rd Qu.:13630 3rd Qu.:13630 3rd Qu.:13636 3rd Qu.:13630 3rd Qu.:13630
Max. :40113 Max. :40077 Max. :40077 Max. :40113 Max. :40077 Max. :40077


Posible questions


At read inspection step

  • QN1: Are there samples to be removed from the analysis based on read depth?
  • QN2: Are there samples to be removed from the analysis based on read length?
  • QN3: …….?
  • QN4: …….?


At base call quality checking step

  • QN1: Are there samples to be removed from the analysis based on read quality?
  • QN2: What are the minimum trimming parameters?
  • QN3: Is trimming both ends necessary?
  • QN4: Was phiX control used in the experiment, if so should removal of phiX contamination necessary?





Progress report 3


Sequence Processing and Conserved Taxonomy Assignment

The clean paired-end sequences that passed the preprocessing trimming and filtering were processed and classified using mothur-based phylotype and OTU-based approaches as described [@Schloss2009]. This step enables investigators to review merged sequences to determine if expected length is in-line with the targeted 16S rRNA gene region. Lack of good overlapping region could results into too many sequences that are wrong and the effect will be propagated in the downstream analysis. High quality sequences align in the same region when searched against a reference template.


Descriptive statistics


Sequence assembly


Length Overlap_Length MisMatches
Min. : 14 Min. : 0.0 Min. : 0.000
1st Qu.:252 1st Qu.:102.0 1st Qu.: 0.000
Median :252 Median :153.0 Median : 0.000
Mean :241 Mean :137.7 Mean : 1.385
3rd Qu.:253 3rd Qu.:173.0 3rd Qu.: 0.000
Max. :500 Max. :250.0 Max. :116.000


Sequence alignment

QueryLength PairwiseAlignmentLength SimBtwnQuery&Template
Min. : 16.0 Min. : 1.0 Min. : 50.00
1st Qu.:252.0 1st Qu.:252.0 1st Qu.: 96.05
Median :253.0 Median :253.0 Median : 98.41
Mean :244.4 Mean :244.5 Mean : 96.19
3rd Qu.:253.0 3rd Qu.:253.0 3rd Qu.: 99.21
Max. :310.0 Max. :315.0 Max. :100.00


Distribution of processed sequences


  • Bar plots
  • Box plots
  • Density plots
  • Histogram plots


Stacked barplot


Figure x: Stacked barplots of processed sequences grouped by processing variable


Boxplots grouped by processing variable



Boxplots grouped by sex and time variable



Combined distribution plots





Assignment of conserved taxonomy to OTUs



In summary the taxonomy data was generated with high stringent quality control parameters. Initially, 2,920,782 quality-checked sequences were classified at 80% identity against 16S rRNA gene sequences deposited in version 132 of SILVA rRNA database as described in the previous section. In OTUbased approach in particular, the clustering done by opticlust algorithm in-built in-built in mothur yielded reliable output based on the high metrics generated and a low FDR = 0.002 obtained (Table 1).


Table x: Statistical parameters calculated from the OTU-based classification method
Parameter Value
cutoff 0.2
Sensitivity 0.998
Specificity 0.999
PPV 0.998
NPV 0.999
Accuracy 0.999
MCC 0.997
F1score 0.998
FDR 0.002


Posible questions


At sequence processing

  • QN1: Do the paired sequences overlap as expected?
  • QN2: Are there sequences to be removed from the analysis based on sequence length?
  • QN3: Does the dorminant length match the targeted region of 16S rRNA gene?
  • QN4: Are the representative sequences too few or too many? Note that too many non-redundant sequences may results from poor overlapping between forward and reverse reads. Review before proceeding to avoid misleading conclusions.


At sequence classification

  • QN1: What taxonomic classifier should be used? Think of need to develop and use custom classifier that suits the study objectives.
  • QN2: Are there sequences assigned to non-bacterial taxonomic lineages?
  • QN3: Should the matches to non-bacterial lineages be removed from further analysis?
  • QN4: …….?


At error estimation

  • QN1: Is the error too high to cause rejection of the results results?
  • QN2: …….?
  • QN3: …….?
  • QN4: …….?


At OTU clustering and conserved taxonomy assignment

  • QN1: What cutoff should be used?
  • QN2: …….?
  • QN3: …….?
  • QN4: …….?




Progress report 4


OTU and taxon abundance


  • Integrating output obtained from diverse methods could be a robust way for profiling complex microbial communities present in a given sample.
  • The phylotype-based approach yielded 197 OTUs at genus level
  • The OTU-based approach generated 11,257 OTU clusters at 97% identity.
  • The phylogeny approach generated 58,929 tree nodes which were taxonomically classified at 80% identity.
  • Manual review and annotation should lead to actual non-redundant results.


Transformed and annotated abundances

Figure x: Unfiltered and curated OTU abundance. Visual representation of taxon terms highlight the most abundant taxon based on frequency of being assigned to an OTU or tree nodes. Muribaculaceae is the most frequently assigned family and Muribaculaceae_ge is the most frequent species assigned to most sequences.


Most abundant species (top 10%)



OTUs
 [1] "Otu001" "Otu002" "Otu003" "Otu004" "Otu005" "Otu006" "Otu007"
 [8] "Otu008" "Otu010" "Otu011" "Otu012" "Otu013" "Otu014" "Otu016"
[15] "Otu017" "Otu019" "Otu020" "Otu021" "Otu022" "Otu023" "Otu024"
[22] "Otu026" "Otu028" "Otu029" "Otu031" "Otu032" "Otu037" "Otu038"
[29] "Otu041" "Otu045"

Phylum
[1] "Bacteroidetes" "Firmicutes"   

Classn
[1] "Bacilli"          "Bacteroidia"      "Clostridia"      
[4] "Erysipelotrichia"

Order
[1] "Anaeroplasmatales"  "Bacteroidales"      "Bifidobacteriales" 
[4] "Clostridiales"      "Erysipelotrichales" "Lactobacillales"   
[7] "Mollicutes_RF39"    "Verrucomicrobiales"

Family
 [1] "Akkermansiaceae"               "Bacteroidaceae"               
 [3] "Bifidobacteriaceae"            "Clostridiaceae_1"             
 [5] "Clostridiales_vadinBB60_group" "Erysipelotrichaceae"          
 [7] "Lachnospiraceae"               "Lactobacillaceae"             
 [9] "Mollicutes_RF39_fa"            "Muribaculaceae"               
[11] "Peptococcaceae"                "Rikenellaceae"                
[13] "Ruminococcaceae"              

Genus
 [1] "Acetatifactor"                    "Akkermansia"                     
 [3] "Alistipes"                        "Anaerotruncus"                   
 [5] "Bacteroides"                      "Bifidobacterium"                 
 [7] "Candidatus_Arthromitus"           "Clostridiales_vadinBB60_group_ge"
 [9] "Lachnoclostridium"                "Lachnospiraceae_NK4A136_group"   
[11] "Lachnospiraceae_UCG.001"          "Lactobacillus"                   
[13] "Mollicutes_RF39_ge"               "Muribaculaceae_ge"               
[15] "Oscillibacter"                    "Roseburia"                       
[17] "Ruminiclostridium"                "Ruminiclostridium_5"             
[19] "Ruminiclostridium_9"              "Ruminococcaceae_ge"              
[21] "Turicibacter"                    


Rank abundance of selected samples


  • Rank abundance curves can be used to display species richness and species evenness across selected samples.

Figure x. Rank abundance of eight selected samples. Package: goeveg



Correlation between species


  • Seven different visualization methods can be used:
    • circle
    • square
    • ellipse
    • number
    • shade
    • color
    • pie
    • mixed
    • R package corrplot.


Figure x. Correlation between species identified at phylum-level. Species are ordered alphabetically (top panel) and heuristically (bottom panel)


Alpha diversity



Species richness at each datapoint

Figure x: Stacked barplots for species richness. The estimated richness (green bars) was calculated using chao calculator and observed ichness (red bars) was calculated using sobs.


Species richness: Boxplots, density plots and histograms

Figure x: Species richness (observed species) displayed by boxplot (A), density plots (B) and histograms (C).


Correlation between species richness and sequence depth

Figure x: Correlation between species richness and sequence depth. Observed species calculated using sobs (A) and estimated species richness by chao calculator (B).


Correlation between species diversity and species richness


Figure x: Species diversity and correlation to species richness. Definitely phylo-diversity (C) correlates well with the species richness.


Species accumulation


  • Methods used
    • Exact
    • Random
    • Collector
    • Rarefaction



Individual species accumulation curves using different methods


Overplot graph of species accumulation curves


Rarefaction and Extrapolation (R/E)


Type 1: Sample-size-based R/E curve


Type 2: Sample completeness curve

Figure x: Species diversity estimates as a function of sample size. Only species with abundance greater or equal to 1 are detected in the sample.


Type 3: Coverage‐based R/E curves



Figure x. Rarefaction and extrapolation curves. Sample-size-based curve (A), sample completeness curve (B), Coverage‐based curves (C).


Shannon diversity index


Inverse Simpson diversity index



Beta diversity


Heatmaps


Phyla


Class


Order


Family


Genus


PAM clustering


  • Partitioning Around Medoids (PAM)
  • Is considered to be the more robust version of K-means.
  • Medoids are representative objects of the cluster.
  • Starts by determining the best number of clusters using factoextra::fviz_nbclust()
  • Method: Silhouette
  • Metric = Euclidean
  • Robust for partitioning data set into clusters of observation.
  • User are required to know the data to indicate the appropriate number of clusters to be produced.
  • Visualize clusters (pam results) using factoextra::fviz_cluster()


Number of best clusters


OTUbased
Number_clusters     Value_Index 
         2.0000         22.3718 
Phylum
Number_clusters     Value_Index 
         2.0000        114.5654 
Class
Number_clusters     Value_Index 
         2.0000         73.9034 
Order
Number_clusters     Value_Index 
           2.00           42.17 
Family
Number_clusters     Value_Index 
          2.000          31.017 
Genus
Number_clusters     Value_Index 
          2.000          24.511 


Visualization of best clusters

Figure x: Optimal number of OTU clusters. The suggested number of best clusters (dotted line) thta could expllain most variation is 2 for OTUs (A), 3 for phylum (B), 3 for class (C), 2 for Order (D), 10 for Family (E) and 2 for Genus (F). A high average silhouette width indicates high quality clustering.


Cluster validation

  cluster size ave.sil.width
1       1  232          0.67
2       2  128          0.25
  cluster size ave.sil.width
1       1  240          0.68
2       2  120          0.24
  cluster size ave.sil.width
1       1  239          0.67
2       2  121          0.22
  cluster size ave.sil.width
1       1  239          0.67
2       2  121          0.22
  cluster size ave.sil.width
1       1  229          0.67
2       2  131          0.19
  cluster size ave.sil.width
1       1  241          0.68
2       2  119          0.29

Figure x: Silhouette plot guided by the best number of clusters. Observations with a large Si (almost 1) are very well clustered. A small Si (around 0) means that the observation lies between two clusters while a negative Si are probably placed in the wrong cluster.



Ordination projections


PCA (Principal Component Analysis)


  • Identifies smaller number of uncorrelated variables (principal components) from a large set of data.
  • Explains the maximum amount of variance with the minimum number of principal components.
  • Missing values are replaced by the column mean
  • Use scree plot to estimate which components explain most of the variability in the data

Scree plot: Analysis of number of suitable complnents

Figure x: Scree plot of PCA. Shows which components explain most of the variability in the data. Over 80% of the variances contained in OTU and taxonomy data are retained by the first two principal components. The first PC explains the maximum amount of variation in the data set.


While PCA is based on Euclidean distances the PCoA is based on the (dis)similarity matrix calculated from OTU abundance data as described earlier. Literally, in any successful PCA or PCoA the first few axes are supposed to capture most of the variation in the input data. NMDS tries to substitute the original distance data with ranks. Unlike the PCA and PCoA the NMDS axes of the ordination are not ordered according to the variance they explain, instead a plot of stress values (a measure of goodness-of-fit) against dimensionality can be used to assess the proper choice of dimensions. Note that stress values >0.2 are generally considered hard to interpret, whereas values <0.1 are good and <0.05 are the better. In any case the inflexion point on scree plots and Shepard plots (stress plots) can be used to guide the selection of a minimum number of dimensions to use in the interpretation of the multidimensional data.


PCoA or PCO: Principal Coordinates Analysis


  • Similar to PCA but is basically an eigen-analysis of a distance or dissimilarity matrix.


Figure x: Principal coordinate analysis ordination using Bray-Curtis dissimilarity matrix. Objects that are ordinated closer together have smaller dissimilarity values than those ordinated further apart. A successful PCoA will capture most of the variation in the (dis)similarity matrix in a few PCoA axes.


NMDS (Non-metric multidimensional scaling)


  • Goodness of fit and Sheperd plot can be used to determine the god or poor fit.
  • Stress values are used to indicate the quality of NMDS ordination, <0.1 are considered fair, <0.05 indicate good fit.
  • Reference


Parameters used for the NMDS


OTUs
----------------------------

Call:
metaMDS(comm = otu.t[, -1], distance = "bray", k = 3, try = 10,      display = c("sites"), choices = c(1, 2), type = "t", shrink = FALSE) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(otu.t[, -1])) 
Distance: bray 

Dimensions: 3 
Stress:     0.1182664 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(otu.t[, -1]))' 

Phylum
----------------------------

Call:
metaMDS(comm = phylum.t[, -1], distance = "bray", k = 3, try = 10,      display = c("sites"), choices = c(1, 2), type = "t", shrink = FALSE) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(phylum.t[, -1])) 
Distance: bray 

Dimensions: 3 
Stress:     0.1201002 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(phylum.t[, -1]))' 

Class
----------------------------

Call:
metaMDS(comm = class.t[, -1], distance = "bray", k = 3, try = 10,      display = c("sites"), choices = c(1, 2), type = "t", shrink = FALSE) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(class.t[, -1])) 
Distance: bray 

Dimensions: 3 
Stress:     0.1402302 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(class.t[, -1]))' 

Order
----------------------------

Call:
metaMDS(comm = order.t[, -1], distance = "bray", k = 3, try = 10,      display = c("sites"), choices = c(1, 2), type = "t", shrink = FALSE) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(order.t[, -1])) 
Distance: bray 

Dimensions: 3 
Stress:     0.143402 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(order.t[, -1]))' 

Family
----------------------------

Call:
metaMDS(comm = family.t[, -1], distance = "bray", k = 3, try = 10,      display = c("sites"), choices = c(1, 2), type = "t", shrink = FALSE) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(family.t[, -1])) 
Distance: bray 

Dimensions: 3 
Stress:     0.1343065 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(family.t[, -1]))' 

Genus
----------------------------

Call:
metaMDS(comm = genus.t[, -1], distance = "bray", k = 3, try = 10,      display = c("sites"), choices = c(1, 2), type = "t", shrink = FALSE) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(genus.t[, -1])) 
Distance: bray 

Dimensions: 3 
Stress:     0.0001578126 
Stress type 1, weak ties
Two convergent solutions found after 10 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(genus.t[, -1]))' 


Figure X. Sherperd and non-metric multidimensional scaling plot. Green oints represent samples and red points represent OTU or species. Similar samples are ordinated together. Stress values are shown at the botthom of ordination plot.


Phylogenetic clustering and annotation


Figure x: Sample Phylip or Newick-formatted tree clustered using the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm. Similar data was used to construct different types of tree including rectangular (A), circular (B) and unrooted (C) to view how samples were clustered.



Posible questions


Alpha diversity

  • QN1: Are the values obtained too sensitive to sampling?
  • QN2: Was the sampling effort sufficient to account for most OTUs present in a sample?
  • QN3: Is there a need to continue with re-sampling?
  • QN4: …….?


Beta diversity

  • QN1: …….?
  • QN2: …….?
  • QN3: …….?
  • QN4: …….?


——————————————————–

More intervention by investigators